Skip to content

Conversation

@UnravelSports
Copy link
Contributor

@UnravelSports UnravelSports commented Dec 17, 2025

This is a continuation of stephTchembeu#5 and #513. Due to all the misalignments it was easier to create a new PR.

Overview

This PR adds TrackingDataset.to_cdf() functionality. It includes @koenvo improved writing PR #515 and it includes the cleaned up version of work done by @stephTchembeu.

Basically, we can now output kloppy tracking data to the Common Data Format (Anzer et al. 2025).

from kloppy import skillcorner

dataset = skillcorner.load_open_data(only_alive=False)

dataset.to_cdf(
    metadata_output_file='output/metadata.json',
    tracking_output_file='output/tracking.jsonl'
)

Because kloppy does not process some mandatory values for the CDF (stadium id, competition id, season id, version (tracking) and collection timing) doing the above will throw some warnings, namely:

UserWarning: Missing mandatory ID at 'competition.id'. Currently replaced with the value 'MISSING_MANDATORY_VALUE'. Please provide the correct value to 'additional_metadata' to completely adhere to the CDF specification.

We can resolve this by passing additional_metadata to the to_cdf functionality, using the Common Data Format Validator TypedDicts (you don't have to use this, but it helps keep everything in the correct schema), like so:

from cdf.domain import CdfMetaDataSchema, Stadium, Competition, Season, Meta, Tracking

additional_meta_data = CdfMetaDataSchema(
    competition=Competition(
        id="COMP_123",
        name="Test Competition",
        format="league_20"
    ),
    season=Season(id="SEASON_2024", name="2024/25"),
    stadium=Stadium(
        id="STADIUM_456",
        name="Test Arena",
        turf="grass",
    ),
    meta=Meta(
        tracking=Tracking(
            version="2.0.0",
            name="TestTracker",
            fps=30,
            collection_timing="live"
        )
    )
)

We can then run:

from kloppy import skillcorner

dataset = skillcorner.load_open_data(only_alive=False)

dataset.to_cdf(
    metadata_output_file='output/metadata.json',
    tracking_output_file='output/tracking.jsonl',
    additional_metadata=additional_meta_data
)

This will now not throw any warnings and it should output the correct files.

Note: we set only_alive=True, because not doing so will also show a warning.

Common Data Format Validator

We have new unit tests that test the writing functionality, and tests that validate the output schema to the CDF using the common-data-format-validator. This is a development dependency. Note that if the CDF changes it's structure, these tests will fail on the kloppy side too. I can imagine this is not ideal, but not sure what to do about this. Any suggestions here are more than welcome.

Next Steps

I would like to continue with reading CDF tracking data and writing and reading CDF event data. Should I do this in a new PR, or shall I pile everything into this?

stephTchembeu and others added 27 commits October 28, 2025 09:42
This adds comprehensive write support to the open_as_file() function with
efficient memory management and streaming capabilities.

Key features:
- BufferedStream: SpooledTemporaryFile wrapper with chunked I/O (5MB memory threshold)
- Write modes: 'wb' (write), 'ab' (append) - binary only
- Adapter pattern: write_from_stream() method (opt-in for adapters)
- Compression support: .gz, .bz2, .xz files handled automatically
- Local files and S3 URIs supported via FSSpecAdapter
- Protocols for type safety: SupportsRead, SupportsWrite

Implementation details:
- read_from()/write_to() methods use shutil.copyfileobj for chunked copying
- Context manager pattern buffers writes and flushes on exit
- No breaking changes to existing read functionality
@UnravelSports UnravelSports changed the title [test] [CDF] to_cdf Dec 17, 2025
@UnravelSports UnravelSports requested review from probberechts and removed request for probberechts December 17, 2025 11:58
@UnravelSports
Copy link
Contributor Author

Added

from kloppy import cdf

dataset = cdf.load_tracking(
    meta_data="output/metadata.json",
    raw_data="output/tracking.jsonl.gz"
)

Full functionality including tests.

The implementation includes a forced validation of the incoming data, using the common-data-format-validator. This means, if you want to use the cdf.load_tracking functionality you need to have this installed. If this is not preferred, I can remove it, but it could be informative at least when people start using it in the beginning.

I can also make this validation optional of course.

only_alive: Optional[bool] = True,
) -> TrackingDataset:
"""
Load Common Data Format broadcast tracking data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this only work for broadcast tracking data? I guess this is a copy-paste error.

@@ -0,0 +1,5 @@
"""Functions for loading SkillCorner broadcast tracking data."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Functions for loading SkillCorner broadcast tracking data."""
"""Functions for loading data in the Common Data Format (CDF) standard."""
  • I think it would be good to add a sentence explaining the CDF with a reference to the arxiv paper.

Comment on lines +1229 to +1231
base_coordinate_system: ProviderCoordinateSystem | None = None,
pitch_length: float | None = None,
pitch_width: float | None = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use Optional[T] here instead of T | None to be compatible with Python 3.9


@property
def pitch_dimensions(self) -> PitchDimensions:
return NormalizedPitchDimensions(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these should be MetricPitchDimensions

raise KloppyParameterError(f"Engine {engine} is not valid")

def to_cdf(self):
if self.dataset_type != DatasetType.TRACKING:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the if here and override in the TrackingDataset. A parent class should not be aware of its children.


serializer = CDFTrackingSerializer()

# TODO: write files but also support non-local files, similar to how open_as_file supports non-local files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is done now.

Comment on lines +336 to +360
@contextlib.contextmanager
def _write_context_manager(
uri: str, mode: str
) -> Generator[BinaryIO, None, None]:
"""
Context manager for write operations that buffers writes and flushes to adapter on exit.
Args:
uri: The destination URI
mode: Write mode ('wb' or 'ab')
Yields:
A BufferedStream for writing
"""
buffer = BufferedStream()
try:
yield buffer
finally:
adapter = get_adapter(uri)
if adapter:
adapter.write_from_stream(uri, buffer, mode)
else:
raise AdapterError(f"No adapter found for {uri}")


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be removed. Something went wrong in the merge with master. It's now duplicated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a new test file? Can't we simply use the existing SkillCorner test data?

"pytest-httpserver", # Mock HTTP server for testing
"moto[s3,server]", # Mock AWS S3 for testing
"pre-commit>=4.2.0", # Git hooks for code quality checks
"common-data-format-validator==0.0.13",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be added as a dev dependency. Create a new optional dependency group

pitch_dimensions=transformer.get_to_coordinate_system().pitch_dimensions,
frame_rate=frame_rate,
orientation=orientation,
provider=Provider.CDF,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether this is desired. I believe the value of this attribute should be the actual data provider. Not the data representation standard used. Doesn't the CDF store the actual data provider?

@probberechts
Copy link
Contributor

Can you also add some documentation on how to load CDF data and how to export a dataset to the CDF format?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants